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Abstract 

The  digital  libraries  of  the  future  will  include  not  only 
(ASCII)  text  information  but  scanned  paper  documents 
as  well  as  still  photographs  and  videos.  There  is,  there¬ 
fore,  a  need  to  index  and  retrieve  information  from  such 
multi-media  collections.  The  Center  for  Intelligent  Infor¬ 
mation  Retrieval  ( CIIR)  has  a  number  of  projects  to  index 
and  retrieve  multi-media  information.  These  include: 

1.  The  extraction  of  text  from  images  which  may  be 
used  both  for  finding  text  zones  against  general 
backgrounds  as  well  as  for  indexing  and  retrieving 
image  information. 

2.  Indexing  hand-written  and  poorly  printed  docu¬ 
ments  using  image  matching  techniques  (word  spot¬ 
ting). 

3.  Indexing  images  using  their  content. 

1  Introduction 

The  digital  libraries  of  the  future  will  include  not  only 
(ASCII)  text  information  but  scanned  paper  documents 
as  well  as  still  photographs  and  videos.  There  is,  there¬ 
fore,  a  need  to  index  and  retrieve  information  from  such 
multi-media  collections.  The  Center  for  Intelligent  In¬ 
formation  Retrieval  (CIIR)  has  a  number  of  projects  to 
index  and  retrieve  multi-media  information.  These  in¬ 
clude: 

1.  Finding  Text  in  Images:  The  conversion  of  scanned 
documents  into  ASCII  so  that  they  can  be  in¬ 
dexed  using  INQUERY  (CIIR's  text  retrieval  en¬ 
gine).  Current  Optical  Character  Recognition  Tech¬ 
nology  (OCR)  can  convert  scanned  text  to  ASCII 
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but  is  limited  to  good  clean  machine  printed  fonts 
against  clean  backgrounds.  Handwritten  text,  text 
printed  against  shaded  or  textured  backgrounds  and 
text  embedded  in  images  cannot  be  recognized  well 
(if  it  can  be  recognized  at  all)  with  existing  OCR 
technology.  Many  financial  documents,  for  exam¬ 
ple,  print  text  against  shaded  backgrounds  to  pre¬ 
vent  copying. 

The  Center  has  developed  techniques  to  detect  text 
in  images.  The  detected  text  is  then  cleaned  up 
and  binarized  and  run  through  a  commercial  OCR. 
Such  techniques  can  be  applied  to  zoning  text  found 
against  general  backgrounds  as  well  as  for  indexing 
and  retrieving  images  using  the  associated  text. 

2.  Word  Spotting:  The  indexing  of  hand-written  and 
poorly  printed  documents  using  image  matching 
techniques.  Libraries  hold  vast  collections  of  orig¬ 
inal  handwritten  manuscripts,  many  of  which  have 
never  been  published.  Word  Spotting  can  be  used 
to  create  indices  for  such  handwritten  manuscript 
archives. 

3.  Image  Retrieval:  Indexing  images  using  their  con¬ 
tent.  The  Center  has  also  developed  techniques  to 
index  and  retrieve  images  by  color  and  appearance. 

2  Finding  Text  in  Images 

Most  of  the  information  available  today  is  either  on  pa¬ 
per  or  in  the  form  of  still  photographs  and  videos.  To 
build  digital  libraries,  this  large  volume  of  information 
needs  to  be  digitized  into  images  and  the  text  converted 
to  ASCII  for  storage,  retrieval,  and  easy  manipulation. 
For  example,  video  sequences  of  events  such  as  a  bas¬ 
ketball  game  can  be  annotated  and  indexed  by  extract¬ 
ing  a  player’s  number,  name  and  the  team  name  that  ap¬ 
pear  on  the  player’s  uniform  (Figure  l(b,  c)).  This  maybe 
combined  with  methods  for  image  indexing  and  retrieval 
based  on  image  content  (see  section  3). 

Current  OCR  technology  [1,  20]  is  largely  restricted 
to  finding  text  printed  against  clean  backgrounds,  since 
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Figure  1 :  The  system,  example  input  image,  and  extracted  text,  (a)  The  top  level  components  of  the  text  detection  and  extraction 
system.  The  pyramid  of  the  input  image  is  shown  as  I,  I\,  I2  ■  ■ (b)  An  example  input  image;  (c)  Output  of  the  system  before 
being  fed  to  the  Character  Recognition  module. 


in  these  cases  it  is  easy  to  binarize  the  input  images  to 
extract  text  (text  binarization)  before  character  recog¬ 
nition  begins.  It  cannot  handle  text  printed  against 
shaded  or  textured  backgrounds,  nor  text  embedded  in 
pictures.  More  sophisticated  text  reading  systems  usu¬ 
ally  employ  page  segmentation  schemes  to  identify  text 
regions.  Then  an  OCR  module  is  applied  only  to  the 
text  regions  to  improve  its  performance.  Some  of  these 
schemes  [32,  33,  21,  23]  are  top-down  approaches,  some 
are  bottom-up  methods  [7,  22],  and  others  are  based  on 
texture  segmentation  techniques  in  computer  vision  [8], 
However,  the  top-down  and  bottom-up  approaches  usu¬ 
ally  require  the  input  image  to  be  binary  and  have  a  Man¬ 
hattan  layout.  Although  the  approach  in  [8]  can  in  prin¬ 
ciple  be  applied  to  greyscale  images,  it  was  only  used 
on  binary  document  images,  and  in  addition,  the  text 
binarization  problem  was  not  addressed.  In  summary, 
few  working  systems  have  been  reported  that  can  read 
text  from  document  pages  with  both  structured  and  non- 
structured  layouts.  A  brief  overview  of  a  system  devel¬ 
oped  at  CIIR  for  constructing  a  complete  automatic  text 
reading  system  is  presented  here  (for  more  details  see 
[34,  35]). 

2.1  System  Overview 

The  system  takes  advantage  of  the  following  distinctive 
characteristics  of  text  which  make  it  stand  out  from  other 
image  information:  (1)  Text  possesses  a  distinctive  fre¬ 
quency  and  orientation  attributes;  (2)  Text  shows  spatial 
cohesion  —  characters  of  the  same  text  string  are  of  sim¬ 
ilar  heights,  orientation  and  spacing. 

The  first  characteristic  suggests  that  text  may  be 
treated  as  a  distinctive  texture,  and  thus  be  segmented 
out  using  texture  segmentation  techniques.  Thus,  the  first 
phase  of  our  system  is  Texture  Segmentation  as  shown  in 
Figure  1(a).  In  the  Chip  Generation  phase,  strokes  are 
extracted  from  the  segmented  text  regions.  Using  rea¬ 


sonable  heuristics  on  text  strings  based  on  the  second 
characteristic,  the  extracted  strokes  are  then  processed  to 
form  tight  rectangular  bounding  boxes  around  the  corre¬ 
sponding  text  strings.  To  detect  text  over  a  wide  range 
of  font  sizes,  the  above  steps  are  applied  to  a  pyramid 
of  images  generated  from  the  input  image,  and  then  the 
boxes  formed  at  each  resolution  level  of  the  pyramid  are 
fused  at  the  original  resolution.  A  Text  Clean-up  mod¬ 
ule  which  removes  the  background  and  binarizes  the  de¬ 
tected  text  is  applied  to  extract  the  text  from  the  regions 
enclosed  by  the  bounding  boxes.  Finally,  text  bounding 
boxes  are  refined  (re-generated)  by  using  the  extracted 
items  as  strokes.  These  new  boxes  usually  bound  text 
strings  better.  The  Text  Clean-up  process  is  then  carried 
out  on  the  regions  bounded  by  these  new  boxes  to  extract 
cleaner  text,  which  can  then  be  passed  through  a  com¬ 
mercial  OCR  engine  for  recognition  if  the  text  is  of  an 
OCR-recognizable  font.  The  phases  of  the  system  are 
discussed  in  the  following  sections. 

2.2  The  Texture  Segmentation  Module 

A  standard  approach  to  texture  segmentation  is  to  first 
filter  the  image  using  a  bank  of  linear  filters  such  as 
Gaussian  derivatives  [11]  or  Gabor  functions,  followed 
by  some  non-linear  transformation  such  as  a  hyperbolic 
function  tanh(at).  Then  features  are  computed  to  form 
a  feature  vector  for  each  pixel  from  the  filtered  im¬ 
ages.  These  feature  vectors  are  then  classified  to  seg¬ 
ment  the  textures  into  different  classes  (for  more  details 
see  [34,  35]). 

Figure  2(a)  shows  a  portion  of  an  original  input  im¬ 
age  with  a  variety  of  textual  information  to  be  extracted. 
There  is  text  on  a  clean  dark  background,  text  printed 
on  Stouffer  boxes,  Stouffer’s  trademarks  (in  script),  and 
a  picture  of  the  food.  Figure  2(b)  shows  the  final  seg¬ 
mented  text  regions. 
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Figure  2:  Results  of  Texture  Segmentation  and  Chip  Generation,  (a)  Portion  of  an  input  image;  (b)  The  final  segmented  text 
regions;  (c)  Extracted  strokes;  (d)  Text  chips  mapped  on  the  input  image. 
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Figure  3:  The  scale  problem  and  its  solution,  (a)  Chips  generated  for  the  input  image  at  full  resolution;  (b)  half  resolution;  (c)  \ 
resolution;  (d)  Chips  generated  at  all  three  levels  mapped  onto  the  input  image.  Scale-redundant  chips  are  removed. 


2.3  The  Chip  Generation  Phase 

In  practice,  text  may  occur  in  images  with  complex  back¬ 
grounds  and  texture  patterns,  such  as  foliage,  windows, 
grass  etc.  Thus,  some  non-text  patterns  may  pass  the  fil¬ 
ters  and  initially  be  misclassified  as  text  (Figure  2(b)). 
Furthermore,  segmentation  accuracy  at  texture  bound¬ 
aries  is  a  well-known  and  difficult  problem  in  texture 
segmentation.  Consequently,  it  is  often  the  case  that  text 
regions  are  connected  to  other  regions  which  do  not  cor¬ 
respond  to  text,  or  one  text  string  might  be  connected  to 
another  text  string  of  a  different  size  or  intensity.  This 
might  cause  problems  for  later  processing.  For  example, 
if  two  text  strings  with  significantly  different  intensity 
levels  are  joined  into  one  region,  one  intensity  threshold 
might  not  separate  both  text  strings  from  the  background. 

Therefore,  heuristics  need  to  be  employed  to  refine 
the  segmentation  result.  Since  the  segmentation  process 
usually  finds  text  regions  while  excluding  most  of  those 
that  are  non-text,  these  regions  can  be  used  to  direct  fur¬ 
ther  processing  (focus  of  attention).  Furthermore,  since 
text  is  intended  to  be  readable,  there  is  usually  a  sig¬ 
nificant  contrast  between  it  and  the  background.  Thus 
contrast  can  be  utilized  finding  text.  Also,  it  is  usually 
the  case  that  characters  in  the  same  word/phrase/sentence 
are  of  the  same  font  and  have  similar  heights  and  inter¬ 


character  spaces.  Finally,  it  is  obvious  that  characters  in  a 
horizontal  text  string  are  horizontally  aligned.  Therefore, 
all  the  heuristics  above  are  incorporated  in  the  Chip  Gen¬ 
eration  phase  in  a  bottom-up  fashion:  significant  edges 
form  strokes  (Figure  2(c));  strokes  from  the  segmented 
regions  are  aggregated  to  form  chips  corresponding  to 
text  strings.  The  rectangular  bounding  boxes  of  the  chips 
are  used  to  indicate  where  the  hypothesized  (detected) 
text  strings  are  (Figure  2(d)).  These  steps  are  described 
in  detail  in  [34,  35], 

2.4  A  Solution  to  the  Scale  Problem 

The  frequency  channels  used  in  the  segmentation  pro¬ 
cess  work  well  to  cover  text  over  a  certain  range  of  font 
sizes.  Text  from  larger  font  sizes  is  either  missed  or  frag¬ 
mented.  This  is  called  the  scale  problem.  Intuitively,  the 
larger  the  font  size  of  the  text,  the  lower  the  frequency  it 
possesses.  Thus,  when  the  text  font  size  gets  too  large, 
its  frequency  falls  outside  the  channels  selected  in  sec¬ 
tion  2.2. 

A  pyramid  approach  (Figure  1(a))  is  used  to  solve  the 
scale  problem:  a  pyramid  of  the  input  image  is  formed 
and  each  image  in  the  pyramid  is  processed  as  described 
in  the  previous  sections.  At  the  bottom  of  the  pyramid 
is  the  original  image;  the  image  at  each  level  (other  than 
the  bottom)  has  half  of  the  resolution  as  that  of  the  im- 
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Figure  4:  Binarization  results  before  and  after  the  Chip  Refinement  step,  (a)  Input  image;  (b)  binarization  result  before  refinement 
(c)  after  refinement. 


age  one  level  below.  Text  of  smaller  font  sizes  can  be 
detected  using  the  images  lower  in  the  pyramid  (Figure 
3(a)),  while  text  of  large  font  sizes  is  found  using  images 
higher  in  the  pyramid  (Figure  3(c).  The  bounding  boxes 
of  detected  text  regions  at  each  level  are  mapped  back  to 
the  original  input  image  and  the  redundant  boxes  are  then 
removed  as  shown  in  Figure  3(d).  Details  are  presented 
in  [34,  35]. 

2.5  Text  on  Complex  Backgrounds 

The  previous  sections  describe  a  system  which  detects 
text  in  images  and  puts  boxes  around  detected  text  strings 
in  the  input  image.  Since  text  may  be  printed  against 
complex  image  backgrounds,  which  current  OCR  sys¬ 
tems  cannot  handle  well,  it  is  desirable  to  have  the  back¬ 
grounds  removed  first.  In  addition,  OCR  systems  require 
that  the  text  must  be  binarized  before  actual  recognition 
starts.  In  this  system,  the  background  removal  and  text 
binarization  is  done  by  applying  an  algorithm  to  the  text 
boxes  individually  instead  of  trying  to  binarize  the  input 
image  as  a  whole.  This  allows  the  process  to  adapt  to  the 
individual  context  of  each  text  string.  The  details  of  the 
algorithm  are  in  [34,  35]. 

2.6  The  Text  Refinement 

Sometimes  non-text  items  are  identified  as  text  as  well. 
In  addition,  the  bounding  boxes  of  the  chips  sometimes 
do  not  tightly  surround  the  text  strings.  The  consequence 
of  these  problems  is  that  non-text  items  may  occur  in 
the  binarized  image,  produced  by  mapping  the  extracted 
items  onto  the  original  page.  An  example  is  shown  in 
Figure  4(a,b).  These  non-text  items  are  not  desirable. 

However,  by  treating  the  extracted  items  as  strokes, 
the  Chip  Refinement  module  which  is  essentially  sim¬ 
ilar  to  the  chip  Generation  module  but  with  stronger 
constraints,  can  be  applied  here  to  eliminate  the  non¬ 
text  items  and  hence  form  tighter  text  bounding  boxes. 
This  can  be  achieved  because  (1)  the  clean-up  proce¬ 
dure  is  able  to  extract  most  characters  without  attach¬ 
ing  to  nearby  characters  and  non-text  items  (Figure  4(b)), 
and  (2)  most  of  the  strokes  at  this  stage  are  composed  of 
complete  or  almost  complete  characters,  as  opposed  to 
the  vertical  connected  edges  of  the  characters  in  the  ini¬ 
tial  processing.  Thus,  it  can  be  expected  that  the  correct 
text  strokes  comply  more  consistently  with  the  heuristics 
used  in  the  early  Chip  Generation  phase.  The  significant 
improvement  is  clearly  shown  in  4c. 

2.7  Experiments 

The  system  has  been  tested  over  48  images  from  a  wide 
variety  of  sources:  digitized  video  frames,  photographs, 
newspapers,  advertisements  in  magazines  or  sales  flyers, 
and  personal  checks.  Some  of  the  images  have  regular 
page  layouts,  others  do  not.  It  should  be  pointed  out  that 
all  the  system  parameters  remain  the  same  throughout 


the  entire  set  of  test  images,  showing  the  robustness  of 
the  system. 

Characters  and  words  (as  perceived  by  one  of  the  au¬ 
thors)  were  counted  in  each  image  as  ground  truth.  The 
total  numbers  over  the  whole  test  set  are  shown  in  the 
“Total  Perceived”  column  in  Table  1 .  The  detected  char¬ 
acters  and  words  are  those  which  are  completely  en¬ 
closed  by  the  boxes  produced  after  the  Chip  Scale  Fu¬ 
sion  step.  The  total  numbers  of  detected  characters  and 
words  over  the  entire  test  set  are  shown  in  the  “Total  De¬ 
tected”  column.  Characters  and  words  clearly  readable 
by  a  person  after  the  Chip  Refinement  and  Text  Clean-up 
steps  (final  extracted  text)  are  also  counted  for  each  im¬ 
age,  with  the  total  numbers  shown  in  the  “Total  Clean¬ 
up”  column.  The  column  “Total  OCRable”  shows  the 
total  numbers  of  cleaned-up  characters  and  words  that 
appear  to  be  of  OCR  recognizable  fonts  in  35  of  the  bi¬ 
narized  images.  Note  that  only  the  text  which  is  horizon¬ 
tally  aligned  is  counted  (skew  angle  of  the  text  string  is 
less  than  roughly  30  degrees)1.  The  “Total  OCRed”  col¬ 
umn  shows  the  numbers  of  characters  and  words  from  the 
“Total  OCRable”  sets  correctly  recognized  by  Caere’s 
commercial  WordScan  OCR  engine. 

Figure  5(a)  is  a  portion  of  an  original  input  image 
which  has  no  structured  layout.  The  final  binarization  re¬ 
sult  is  shown  in  (b)  and  the  corresponding  OCR  output  is 
shown  in  (c).  Notice  that  most  of  the  text  is  detected,  and 
most  of  the  text  of  machine-printed  fonts  are  correctly 
recognized  by  the  OCR  engine.  It  should  be  pointed  out 
that  the  cleaned-up  output  looks  fine  to  a  person  in  the 
places  where  the  OCR  errors  occurred. 

3  Word  Spotting:  Indexing  Handwritten 
Archival  Manuscripts 

There  are  many  historical  manuscripts  written  in  a  sin¬ 
gle  hand  which  it  would  be  useful  to  index.  Exam¬ 
ples  include  the  W.  B.  DuBois  collection  at  the  Uni¬ 
versity  of  Massachusetts,  Margaret  Sanger’s  collected 
works  at  Smith  College  and  the  early  Presidential  li¬ 
braries  at  the  Library  of  Congress.  These  manuscripts 
are  largely  written  in  a  single  hand.  Such  manuscripts 
are  valuable  resources  for  scholars  as  well  as  others  who 
wish  to  consult  the  original  manuscripts  and  consider¬ 
able  effort  has  gone  into  manually  producing  indices 
for  them.  For  example,  a  substantial  collection  of  Mar¬ 
garet  Sanger’s  work  has  been  recently  put  on  microfilm 
(see  http://MEP.cla.sc.edu/Sanger/SangBase.HTM)  with 
an  item  by  item  index.  These  indices  were  created  manu¬ 
ally.  The  indexing  scheme  described  here  will  help  in  the 
automatic  creation  and  production  of  indices  and  concor¬ 
dances  for  such  archives. 

One  solution  is  to  use  Optical  Character  Recognition 
(OCR)  to  convert  scanned  paper  documents  into  ASCII. 

'Here,  the  focus  is  on  finding  horizontal,  linear  text  strings  only. 
The  issue  of  finding  text  strings  of  any  orientation  will  be  addressed  in 
future  work. 


Table  1:  Summary  of  the  system’s  performance.  48  images  were  used  for  detection  and  clean-up.  Out  of  these,  35  binarized 
images  were  used  for  the  OCR  process. 
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Figure  5:  Example  1.  (a)  Original  image  (ads  11);  (b)  Extracted  text;  (c)  The  OCR  result  using  Caere's  WordScan  Plus  4.0  on  b. 


Existing  OCR  technology  works  well  with  standard  ma¬ 
chine  printed  fonts  against  clean  backgrounds.  It  works 
poorly  if  the  originals  are  of  poor  quality  or  if  the  text 
is  handwritten.  Since  Optical  Character  Recognition 
(OCR)  does  not  work  well  on  handwriting,  an  alternative 
scheme  based  on  matching  the  images  of  the  words  was 
proposed  by  us  in  [18,  17,  15]  for  indexing  such  texts. 
Here  a  brief  summary  of  the  work  is  presented. 

Since  the  document  is  written  by  a  single  person,  the 
assumption  is  that  the  variation  in  the  word  images  will 
be  small.  The  proposed  solution  will  first  segment  the 
page  into  words  and  then  match  the  actual  word  images 
against  each  other  to  create  equivalence  classes.  Each 
equivalence  class  will  consist  of  multiple  instances  of  the 
same  word.  Each  word  will  have  a  link  to  the  page  it 
came  from.  The  number  of  words  in  each  equivalence 
class  will  be  tabulated.  Those  classes  with  the  largest 
numbers  of  words  will  probably  be  stopwords,  i.e.  con¬ 
junctions  such  as  “and”  or  articles  such  as  “the”.  Classes 
containing  stopwords  are  eliminated  (since  they  are  not 
very  useful  for  indexing).  A  list  is  made  of  the  remain¬ 
ing  classes.  This  list  is  ordered  according  to  the  num¬ 
ber  of  words  contained  in  each  of  the  classes.  The  user 
provides  ASCII  equivalents  for  a  representative  word  in 
each  of  the  top  m  (say  m  =  2000)  classes.  The  words  in 
these  classes  can  now  be  indexed.  This  technique  will  be 
called  “word  spotting”  as  it  is  analogous  to  “word  spot¬ 
ting”  in  speech  processing  [9]. 


The  proposed  solution  completely  avoids  machine 
recognition  of  handwritten  words  as  this  is  a  difficult  task 
[20],  Robustness  is  achieved  compared  to  OCR  systems 
for  two  reasons: 

1.  Matching  is  based  on  entire  words.  This  is  in  con¬ 
trast  to  conventional  OCR  systems  which  essen¬ 
tially  recognize  characters  rather  than  words. 

2.  Recognition  is  avoided.  Instead  a  human  is  placed 
in  the  loop  when  ASCII  equivalents  of  the  words 
must  be  provided. 

Some  of  the  matching  aspects  of  the  problem  are  dis¬ 
cussed  here  (for  a  discussion  of  page  segmentation  into 
words,  see  [18]).  The  matching  phase  of  the  problem  is 
expected  to  be  the  most  difficult  part  of  the  problem.  This 
is  because  unlike  machine  fonts,  there  is  some  variation 
in  even  a  single  person’s  handwriting.  This  variation  is 
difficult  to  model.  Figure  (6)  shows  two  examples  of  the 
word  “Lloyd”  written  by  the  same  person.  The  last  image 
is  produced  by  XOR’ing  these  two  images.  The  white  ar¬ 
eas  in  the  XOR  image  indicate  where  the  two  versions  of 
“Lloyd”  differ.  This  result  is  not  unusual.  In  fact,  the 
differences  are  sometimes  even  larger. 

The  performance  of  two  different  matching  techniques 
is  discussed  here.  The  first,  based  on  Euclidean  dis¬ 
tance  mapping  [2],  assumes  that  the  deformation  be¬ 
tween  words  can  be  modelled  by  a  translation  (shift). 
The  second,  based  on  an  algorithm  by  Scott  and  Longuet 
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Figure  6:  Two  examples  of  the  word  “Lloyd”  and  the 
XOR  image 

Higgins  [28]  models  the  transformation  between  words 
using  an  affine  transform. 

3.1  Prior  Work 

The  traditional  approach  to  indexing  documents  involves 
first  converting  them  to  ASCII  and  then  using  a  text 
based  retrieval  engine  [30].  Scanned  documents  printed 
in  standard  machine  fonts  against  clean  backgrounds  can 
be  converted  into  ASCII  using  an  OCR  [1],  However, 
handwriting  is  much  more  difficult  for  OCRs  to  handle 
because  of  the  wide  variability  present  in  handwriting 
(not  only  is  there  variability  between  writers,  but  a  given 
person’s  writing  also  varies). 

Image  matching  of  words  has  been  used  to  recognize 
words  in  documents  which  use  machine  fonts  [5,  10]. 
Recognition  rates  are  much  higher  than  when  the  OCR 
is  used  directly  [10].  Machine  fonts  are  simpler  to 
match  than  handwritten  fonts  since  the  variation  is  much 
smaller;  multiple  instances  of  a  given  word  printed  in  the 
same  font  are  identical  except  for  noise.  In  handwrit¬ 
ing,  however,  multiple  instances  of  the  same  word  on  the 
same  page  by  the  same  writer  show  variations.  The  first 
two  pictures  in  Figure  6  are  two  identical  words  from  the 
same  document,  written  by  the  same  writer.  It  may  thus 
be  necessary  to  account  for  these  variations. 

3.2  Outline  of  Algorithm 

1.  A  scanned  greylevel  image  of  the  document  is  ob¬ 
tained. 

2.  The  image  is  first  reduced  by  half  by  gaussian  filter¬ 
ing  and  subsampling. 

3.  The  reduced  image  is  then  binarized  by  threshold¬ 
ing  the  image. 

4.  The  binary  image  is  now  segmented  into  words,  this 
is  done  by  a  process  of  smoothing  and  thresholding 
(see  [18]). 

5.  A  given  word  image  (i.e.  the  image  of  a  word)  is 
used  as  a  template,  and  matched  against  all  the  other 
word  images.  This  is  repeated  for  every  word  in 
the  document.  The  matching  is  done  in  two  phases. 
First,  the  number  of  words  to  be  matched  is  pruned 
using  the  areas  and  aspect  ratios  of  the  word  im¬ 
ages  -  the  word  to  be  matched  cannot  have  an  area 


or  aspect  ratio  which  is  too  different  from  the  tem¬ 
plate.  Next,  the  actual  matching  is  done  by  using 
a  matching  algorithm.  Two  different  matching  al¬ 
gorithms  are  tried  here.  One  of  them  only  accounts 
for  translation  shifts,  while  the  other  accounts  for 
affine  matches.  The  matching  divides  the  word  im¬ 
ages  into  equivalence  classes  -  each  class  presum¬ 
ably  containing  other  instances  of  the  same  word. 

6.  Indexing  is  done  as  follows.  For  each  equivalence 
class,  the  number  of  elements  in  it  is  counted.  The 
top  n  equivalence  classes  are  then  determined  from 
this  list.  The  equivalence  classes  with  the  highest 
number  of  words  (elements)  are  likely  to  be  stop- 
words  (i.e.  conjunctions  like  ‘and’  ,  articles  like 
‘the’,  and  prepositions  like  ‘of’)  and  are  therefore 
eliminated  from  further  consideration.  Let  us  as¬ 
sume  that  of  the  top  n,  m  are  left  after  the  stopwords 
have  been  eliminated.  The  user  then  displays  one 
member  of  each  of  these  m  equivalence  classes  and 
assigns  their  ASCII  interpretation.  These  m  words 
can  now  be  indexed  anywhere  they  appear  in  the 
document. 

We  will  now  discuss  the  matching  techniques  in  detail. 

3.3  Determination  of  Equivalence  Classes 

The  list  of  words  to  be  matched  is  first  pruned  using  the 
areas  and  aspect  ratios  of  the  word  images.  The  pruned 
list  of  words  is  then  matched  using  a  matching  algorithm. 

3.4  Pruning 

It  is  assumed  that 

1  A-word  ... 

-  <  - <  a  (1) 

template 

where  Atempiate  is  the  area  of  the  template  and  Aword 
is  the  area  of  the  word  to  be  matched.  Typical  values  of 
a  used  in  the  experiments  range  between  1.2  and  1.3.  A 
similar  filtering  step  is  performed  using  aspect  ratios  (ie. 
the  width/height  ratio).  It  is  assumed  that 

<  AspectWOrd.  ^ 
ft  Aspecttempl  ate 

The  value  of  ft  used  in  the  experiments  range  between  1 .4 
and  1.7.  In  both  the  above  equations,  the  exact  factors  are 
not  important  but  it  should  not  be  so  large  so  that  valid 
words  are  omitted,  nor  so  small  so  that  too  many  words 
are  passed  onto  the  matching  phase.  The  pruning  values 
may  be  automatically  determined  by  running  statistics  on 
samples  of  the  document  [15]. 

3.5  Matching 

The  template  is  then  matched  against  the  image  of  each 
word  in  the  pruned  list.  The  matching  function  must  sat¬ 
isfy  two  criteria: 


1.  It  must  produce  a  low  match  error  for  words  which 
are  similar  to  the  template. 

2.  It  must  produce  a  high  match  error  for  words  which 
are  dissimilar. 

Two  matching  algorithms  have  been  tried.  The  first 
algorithm  -  Euclidean  Distance  Mapping  (EDM)  -  as¬ 
sumes  that  no  distortions  have  occured  except  for  rela¬ 
tive  translation  and  is  fast.  This  algorithm  usually  ranks 
the  matched  words  in  the  correct  order  (i.e.  valid  words 
first,  followed  by  invalid  words)  when  the  variations  in 
words  is  not  too  large.  Although,  it  returns  the  low¬ 
est  errors  for  words  which  are  similar  to  the  template, 
it  also  returns  low  errors  for  words  which  are  dissimilar 
to  the  template.  The  second  algorithm  [28], referred  to  as 
SLH  here,  assumes  an  affine  transformation  between  the 
words.  It  thus  compensates  for  some  of  the  variations  in 
the  words.  This  algorithm  not  only  ranks  the  words  in  the 
correct  order  for  all  examples  tried  so  far,  it  also  seems 
to  be  able  to  better  discriminate  between  valid  words  and 
invalid  words.  As  currently  implemented  the  SLH  algo¬ 
rithm  is  much  slower  than  the  EDM  algorithm  (we  expect 
to  be  able  to  speed  it  up). 

3.6  Using  Euclidean  Distance  Mapping  for 
Matching 

This  approach  is  similar  to  that  used  by  [6]  to  match  ma¬ 
chine  generated  fonts.  A  brief  description  of  the  method 
follows  (more  details  are  available  from  [18]). 

Consider  two  images  to  be  matched.  There  are  three 
steps  in  the  matching: 

1.  First  the  images  are  roughly  aligned.  In  the  verti¬ 
cal  direction,  this  is  done  by  aligning  the  baselines 
of  the  two  images.  In  the  horizontal  direction,  the 
images  are  aligned  by  making  their  left  hand  sides 
coincide. 

The  alignment  is,  therefore,  expected  to  be  accurate 
in  the  vertical  direction  and  not  as  good  in  the  hori¬ 
zontal  direction.  This  is  borne  out  in  practice. 

2.  Next  the  XOR  image  is  computed.  This  is  done  by 
XOR’ing  corresponding  pixels  (see  Figure  6). 

3.  An  Euclidean  distance  mapping  [2]  is  computed 
from  the  XOR  image  by  assigning  to  each  white 
pixel  in  the  image,  its  minimum  distance  to  a  black 
pixel.  Thus  a  white  pixel  inside  a  blob  is  assigned 
a  larger  distance  than  an  isolated  white  pixel.  An 
error  measure  Eedm  can  now  be  computed  by 
adding  up  the  distance  measures  for  each  pixel. 

4.  Although  the  approximate  translation  has  been 
computed  using  step  1,  this  may  not  be  accurate  and 
may  need  to  be  fine-tuned.  Thus  steps  (2)  and  (3) 
are  repeated  while  sampling  the  translation  space  in 
both  x  and  y.  A  minimum  error  measure  EEDMmin 
is  computed  over  all  the  translation  samples. 


3.7  SLH  Algorithm  for  Matching 

The  EDM  algorithm  does  not  discriminate  well  between 
good  and  bad  matches.  In  addition,  it  fails  when  there  is 
significant  distortion  in  the  words.  This  happens  with  the 
writing  of  Erasmus  Hudson  (Figure  7).  Thus  a  match¬ 
ing  algorithm  which  models  some  of  the  variation  is 
needed.  A  second  matching  algorithm  (SLH),  which 
models  the  distortion  as  an  affine  transformations,  was 
therefore  tried  (note  that  it  is  expected  that  the  real  vari¬ 
ation  is  probably  much  more  complex).  An  affine  trans¬ 
form  is  a  linear  transformation  between  coordinate  sys¬ 
tems.  In  two  dimensions,  it  is  described  by 

r'  =  Ar  +  t  (3) 

where  t  is  a  2-D  vector  describing  the  translation,  A  is 
a  2  by  2  matrix  which  captures  the  deformation,  r'  and 
r  are  the  coordinates  of  corresponding  points  in  the  two 
images  between  which  the  affine  transformation  must  be 
recovered.  An  affine  transform  allows  for  the  following 
deformations  -  scaling  in  both  directions,  shear  in  both 
directions  and  rotation. 

The  algorithm  chosen  here  is  one  proposed  by  Scott 
and  Longuet-Higgins  [28]  (see  [16]).  The  algorithm  re¬ 
covers  the  correspondence  between  two  sets  of  points  I 
and  J  under  an  affine  transform. 

The  sets  I  and  J  are  created  as  follows.  Every  white 
pixel  in  the  first  image  is  a  member  of  the  set  I.  Similarly, 
every  white  pixel  in  the  second  image  is  a  member  of 
set  J.  First,  the  centroids  of  the  point  sets  are  computed 
and  the  origins  of  the  coordinate  systems  is  set  at  the 
centroid.  The  SLH  algorithm  is  then  used  to  compute 
the  correspondence  between  the  point  sets. 

Given  the  (above)  correspondence  between  point  sets 
I  and  J,  the  affine  transform  A,  t  can  be  determined  by 
minimizing  the  following  least  mean  squares  criterion: 

Eslh  =  Yjh  ~  -  t)2  (4) 

i 

where  1/ ,  Ji  are  the  (x,y)  coordinates  of  point  J;  and  J; 
respectively. 

The  values  are  then  plugged  back  into  the  above  equa¬ 
tion  to  compute  the  error  Eslh-  The  error  Eslh  is  an 
estimate  of  how  dissimilar  two  words  are  and  the  words 
can,  therefore,  be  ranked  according  to  it. 

It  will  be  assumed  that  the  variation  for  valid  words 
is  not  too  large.  This  implies  that  if  An  and  A22  are 
considerably  different  from  1 ,  the  word  is  probably  not  a 
valid  match. 

Note:  The  SLH  algorithm  assumes  that  pruning  on  the 
basis  of  the  area  and  aspect  ratio  thresholds  is  performed. 

3.8  Experiments 

The  two  matching  techniques  were  tested  on 
two  handwritten  pages,  each  written  by  a  differ¬ 
ent  writer.  The  first  page  can  be  obtained  from 
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Figure  7:  Part  of  a  page  from  the  collected  papers  of  the  Hudson  family 


the  DIMUND  document  server  on  the  internet 
http://documents.cfar.umd.edu/resources/database/ 
handwriting.database.html  This  page  will  be  referred 
to  as  the  Senior  document.  The  handwriting  on  this 
page  is  fairly  neat  (see  [18]  for  a  picture).  The  second 
page  is  from  an  actual  archival  collection  -  the  Hudson 
collection  from  the  library  of  the  University  of  Mas¬ 
sachusetts  (part  of  the  page  is  shown  in  Figure  (7).  This 
page  is  part  of  a  letter  written  by  James  S.  Gibbons  to 
Erasmus  Darwin  Hudson.  The  handwriting  on  this  page 
is  difficult  to  read  and  the  indexing  technique  helped  in 
deciphering  some  of  the  words. 

The  experiments  will  show  examples  of  how  the 
matching  techniques  work  on  a  few  words.  For  more  ex¬ 
amples  of  the  EDM  technique  see  [18].  For  more  exam¬ 
ples  using  the  SLH  technique  and  comparisons  with  the 
EDM  technique  see  [16].  In  general,  the  EDM  method 
ranks  most  words  in  the  Senior  document  correctly  but 
ranks  some  words  in  the  Hudson  document  incorrectly. 
The  SLH  technique  performs  well  on  both  documents. 

Both  pages  were  segmented  into  words  (see  [18]  for 
details)  The  algorithm  was  then  run  on  the  segmented 
words.  In  the  following  figures,  the  first  word  shown 
is  the  template.  After  the  template,  the  other  words  are 
ranked  according  to  the  match  error.  Note  that  only  the 
first  few  results  of  the  matching  are  shown  although  the 
template  has  been  matched  with  every  word  on  the  page. 
The  area  threshold  a  was  chosen  to  be  1 .2  and  the  aspect 
ratio  threshold  /3  was  chosen  as  1 .4.  The  translation  val¬ 
ues  were  sampled  to  within  ±4  pixels  in  the  X  direction 
and  ±1  pixel  in  the  y  direction.  Experimentally,  this  gave 
the  best  results. 

3.9  Results  using  Euclidean  Distance 
Mapping 

The  Euclidean  Distance  Mapping  algorithm  works  rea¬ 
sonably  well  on  the  Senior  document.  An  example  is 
shown  below. 

In  Figure  (8),  the  template  is  the  word  “Lloyd”.  The 
figure  shows  that  the  four  other  instances  of  “Lloyd” 
present  in  the  document  are  ranked  before  any  of  the 
other  words.  As  Table  (2)  shows,  the  match  errors  for 
other  instances  of  “Lloyd”  is  less  than  that  for  any  other 
word.  In  the  table,  the  first  column  is  the  Token  number 
(this  is  needed  for  identification  purposes),  the  second 
column  is  a  transcription  of  the  word,  the  third  column 
shows  the  area  in  pixels,  the  fourth  gives  the  match  error 
and  the  last  two  columns  specify  the  translation  in  the  x 
and  y  directions  respectively.  Note  the  significant  change 
in  area  of  the  words. 

The  performance  on  other  words  in  the  Senior  docu¬ 
ment  is  comparable  (for  other  examples  see  [18]).  This 
is  because  the  page  is  written  fairly  neatly.  The  perfor¬ 
mance  of  the  method  is  expected  to  correlate  with  the 
quality  of  the  handwriting.  This  was  verified  by  running 
experiments  on  a  page  from  the  Hudson  collection  (Fig- 
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Figure  8:  Ranked  matches  for  template  “Lloyd”  using 
the  EDM  algorithm  (the  rankings  are  ordered  from  left 
to  right  and  from  top  to  bottom). 

ure  7).  The  handwriting  in  the  Hudson  collection  is  diffi¬ 
cult  to  read  even  for  humans  looking  at  grey-level  images 
at  300  dpi  The  writing  shows  wide  variations  in  size  -  for 
example,  the  area  of  the  word  “to”  varies  by  as  much  as 
100%  !  However,  this  large  a  variation  is  not  expected  to 
occur  and  is  not  seen  when  the  words  are  larger.  Since 
humans  have  difficulty  reading  this  material,  we  do  not 
expect  that  the  method  will  perform  very  well  on  this 
document. 

The  Euclidean  Distance  Mapping  technique  fails  for 
the  template  “Standard”  in  the  Hudson  document  (see 
Figure  (9)).  The  failure  occurs  because  the  two  in¬ 
stances  of  “Standard”  are  written  differently.  The  tem¬ 
plate  “Standard”  has  a  gap  between  the  “t”  and  the  “a”. 
This  gap  is  not  present  in  the  second  example  of  “Stan¬ 
dard”  (this  is  more  clearly  visible  in  Figure  (10).  A  tech¬ 
nique  to  model  some  distortions  is,  therefore,  necessary. 
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Figure  9:  Rankings  for  template  “Standard”  using  the 
EDM  algorithm(the  rankings  are  ordered  from  left  to 
right  and  from  top  to  bottom). 

3.10  Experiments  Using  the  SLH 
Algorithm 

The  SLH  algorithm  handles  affine  distortions  and  is, 
therefore  more  powerful  then  the  EDM  algorithm.  Since 
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Table  2:  Rankings  and  match  Errors  for  template  “Lloyd”. 
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Table  3:  Rankings  and  Match  Errors  for  template  “Lloyd”  Using  SLH  Algorithm. 


the  current  version  of  the  SLH  algorithm  is  slow,  the  ini¬ 
tial  matches  were  pruned  using  the  EDM  algorithm  and 
then  the  SLH  algorithm  run  on  the  pruned  subset. 

Experiments  were  performed  using  both  the  Senior 
document  and  the  Hudson  documents.  A  few  examples 
are  shown  here  (for  more  details  see  [16]).  Lor  the  Se¬ 
nior  documents  the  same  pruning  ratios  were  chosen  as 
before.  To  account  for  the  large  variations  in  the  Hudson 
papers,  the  area  threshold  a  was  fixed  at  1 .3  and  the  as¬ 
pect  ratio  threshold  at  1.7.  The  value  of  a  depends  on  the 
expected  translation.  Since  it  is  small,  a  =  2.0.  A  lower 
value  of  a  =  1.5  yielded  poorer  results. 

The  matches  for  the  template  “Lloyd”  are  shown  in  Ta¬ 
ble  (3).  The  succesive  columns  of  the  table,  tabulate  the 
Token  Number,  the  transcription  of  the  word,  the  area  of 
the  word  image,  the  number  of  corresponding  points  re¬ 
covered  by  the  SLH  algorithm,  the  match  error  EsLH 
using  the  SLH  algorithm  and  the  affine  transform.  The 
entries  are  ranked  according  to  the  match  error  Eslh-  If 
either  of  An  or  is  less  than  0.8  or  greater  than  1/0.8, 
that  word  is  eliminated  from  the  rankings.  A  comparison 
with  Table  (2)  shows  that  the  rankings  change.  This  is 


not  only  true  of  the  invalid  words  (for  example  the  sixth 
entry  in  Table  (2)  is  “Maybe”  while  the  sixth  entry  in  Ta¬ 
ble  (3)  is  “lawyer”  but  is  also  true  of  the  “Lloyd”’s.  Both 
tables  rank  instances  of  “Lloyd”  ahead  of  other  words. 
The  technique  also  shows  a  much  greater  discrimination 
in  match  error  -  the  match  error  for  “lawyer”  is  almost 
double  the  match  error  for  the  fifth  “Lloyd”. 

The  method  was  also  run  on  the  Hudson  document 
(Ligure  (7))  and  it  ranked  most  of  the  words  correctly 
on  this  document.  As  an  example,  we  look  at  the  word 
“Standard”  on  which  the  EDM  method  did  not  rank  cor¬ 
rectly.  The  SLH  method  produces  the  correct  ranking  in¬ 
spite  of  the  significant  distortions  in  the  word  (see  Ligure 
(10)). 

3.10.1  Recall-Precision  Results 

Indexing  and  retrieval  techniques  may  be  evaluated  us¬ 
ing  recall  and  precision.  Recall  is  defined  as  the  “pro¬ 
portion  of  relevant  documents  actually  retrieved”  while 
precision  is  defined  as  the  “proportion  of  retrieved  doc¬ 
uments  that  are  relevant”  [31].  Ligure  3.10.1  shows  the 
recall-precision  results  for  both  algorithms  on  the  Senior 
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Figure  10:  Rankings  for  template  “Standard”  for  the 
SLH  algorithm  (the  rankings  are  ordered  from  left  to 
right  and  from  top  to  bottom). 


document.  The  two  EDM  graphs  are  for  two  different 
values  of  the  area  ratio  (1.22  and  1.3).  Notice  that  they 
do  not  differ  significantly,  thus  showing  that  the  exact 
values  of  the  area  ratio  are  not  significant.  The  average 
precision  for  the  EDM  and  SLH  algorithms  on  the  Senior 
document  are  79.7  %  and  86.3  %  respectively.  Note  that 
SLH  performs  significantly  better  than  EDM.  Similar  re¬ 
sults  are  obtained  with  the  Hudson  document. 
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Figure  1 1 :  Recall  precision  results  for  Senior  document 


4  Image  Retrieval 

The  indexing  and  retrieval  of  images  using  their  content 
is  a  poorly  understood  and  difficult  problem.  A  person 
using  an  image  retrieval  system  usually  seeks  to  find  se¬ 
mantic  information.  For  example,  a  person  may  be  look¬ 


ing  for  a  picture  of  a  leopard  from  a  certain  viewpoint.  Or 
alternatively,  the  user  may  require  a  picture  of  Abraham 
Lincoln  from  a  particular  viewpoint. 

Retrieving  semantic  information  using  image  content 
is  difficult  to  do.  The  automatic  segmentation  of  an  im¬ 
age  into  objects  is  a  difficult  and  unsolved  problem  in 
computer  vision.  However,  many  image  attributes  like 
color,  texture,  shape  and  “appearance”  are  often  directly 
correlated  with  the  semantics  of  the  problem.  For  exam¬ 
ple,  logos  or  product  packages  (e.g.,  a  box  of  Tide)  have 
the  same  color  wherever  they  are  found.  The  coat  of  a 
leopard  has  a  unique  texture  while  Abraham  Lincoln’s 
appearance  is  uniquely  defined.  These  image  attributes 
can  often  be  used  to  index  and  retrieve  images. 

The  Center  has  carried  out  pioneering  research  in  this 
area.  The  Center  conducts  research  in  both  color  based 
image  retrieval  see  and  appearance  based  image  retrieval 
(the  methods  applied  to  appearance  based  image  retrieval 
may  also  be  directly  applied  to  texture  based  image  re¬ 
trieval).  We  will  now  discuss  appearance  based  retrieval 
(the  reader  is  referred  to  [3]  for  discussions  about  the 
color  based  retrieval. 

4.1  Retrieval  by  Appearance 

Some  attempts  have  been  made  to  retrieve  objects  using 
their  shape  [4,  24].  For  example,  the  QBIC  system  [4], 
developed  by  IBM,  matches  binary  shapes.  It  requires 
that  the  database  be  segmented  into  objects.  Since  auto¬ 
matic  segmentation  is  an  unsolved  problem,  this  requires 
the  user  to  manually  outline  the  objects  in  the  database. 
Clearly  this  is  not  desirable  or  practical. 

Except  for  certain  special  domains,  all  methods  based 
on  shape  are  likely  to  have  the  same  problem.  An  ob¬ 
ject’s  appearance  depends  not  only  on  its  three  dimen¬ 
sional  shape,  but  also  on  the  object’s  albedo,  the  view¬ 
point  from  which  it  is  imaged  and  a  number  of  other 
factors.  It  is  non-trivial  to  separate  the  different  factors 
constituting  an  object’s  appearance.  For  example,  it  is 
usually  not  possible  to  separate  an  object’s  three  dimen¬ 
sional  shape  from  the  other  factors. 

The  Center  has  overcome  this  difficulty  by  develop¬ 
ing  methods  to  retrieve  objects  using  their  appearance 
[26,  27,  19,  25].  The  methods  involve  finding  objects 
similar  in  appearance  to  an  example  object  specified  by 
the  query. 

To  the  best  of  our  knowledge,  ours  is  the  first  gen¬ 
eral  query  by  appearance  image  retrieval  system.  Sys¬ 
tems  have  been  built  to  retrieve  specific  objects  like  faces 
(e.g.,  [29])).  However,  these  systems  require  a  number  of 
training  examples  and  it  is  not  clear  whether  they  can  be 
generalized  to  retrieve  other  objects. 

Some  of  the  salient  features  of  our  system  include: 

1.  The  ability  to  retrieve  “similar”  images.  This  is  in 
contrast  with  techniques  which  try  to  recover  the 
same  object.  In  our  system,  a  car  used  as  a  query 


will  also  retrieve  other  cars  rather  than  retrieving 
only  cars  of  a  specific  model. 

2.  The  ability  to  retrieve  images  embedded  in  a  back¬ 
ground  (see  for  example  the  cars  in  Figure  13  which 
appear  against  various  backgrounds). 

3.  It  does  not  require  any  prior  manual  segmentation 
of  the  database. 

4.  No  training  is  required. 

5.  It  can  handle  a  range  of  variations  in  size. 

6.  It  can  handle  3D  viewpoint  changes  up  to  about  20 
to  25  degrees. 

The  user  constructs  the  query  by  taking  an  example 
picture,  and  marking  regions  which  she  considers  impor¬ 
tant  aspects  of  the  object.  The  query  may  be  refined  later 
depending  on  the  retrieval  results.  Consider,  for  exam¬ 
ple,  the  first  car  shown  in  Figure  4.1.  The  user  marks  the 
region  shown  in  the  figure  using  a  mouse.  Notice  that 
the  region  reflects  the  fact  that  wheels  are  central  to  a 
car.  The  user’s  query  in  this  situation  is  to  find  visually 
similar  objects  (i.e.,  other  cars)  from  a  similar  viewpoint 
(where  the  viewpoint  can  vary  up  to  25  degrees  from  the 
query). 

The  database  images  are  filtered  with  derivatives  of 
Gaussians  at  multiple  scales.  Derivatives  of  the  first  and 
second  order  are  used.  Differential  invariants  (invariants 
to  2D  rotation)  are  created  using  the  derivatives.  [19, 25], 
An  inverted  list  is  constructed  from  these  invariants.  The 
inverted  list  is  indexed  using  the  value  of  each  invariant. 
The  entire  computation  may  be  carried  out  off-line. 

The  on-line  computation  consists  of  calculating  invari¬ 
ants  for  points  in  the  query  (which  is  a  region  in  the  im¬ 
age).  Points  with  similar  invariant  values  are  now  re¬ 
covered  from  the  database  by  indexing  on  the  invariant 
values.  The  points  obtained  by  indexing  must  also  sat¬ 
isfy  certain  spatial  constraints.  That  is,  the  values  of 
the  invariants  at  a  pixel  and  at  some  of  its  neighbors 
must  match.  This  ensures  that  the  indexing  scheme  pre¬ 
serves  the  spatial  layout  of  objects.  Points  which  satisfy 
this  spatial  relationship  vote  and  the  database  images  are 
ranked  on  the  basis  of  this  vote. 

The  scheme  described  above  works  if  the  object  is 
roughly  the  same  size  in  the  query  and  the  image 
database.  In  practice  it  is  quite  common  for  the  objects 
to  be  of  different  sizes  in  a  database.  The  variation  in 
size  is  handled  by  doing  a  search  over  scale  space.  That 
is,  the  query  is  filtered  with  Gaussian  derivatives  of  dif¬ 
ferent  standard  deviations  [14,  13,  12]  and  the  image  si¬ 
multaneously  warped.  This  allows  objects  over  a  range 
of  sizes  to  be  matched  [26,  27], 

The  query  is  outlined  by  the  user  with  a  mouse  Figure 
4.1.  Figure  13  shows  the  results  of  a  query.  Notice  that 
a  large  number  of  cars  with  white  wheels  have  been  re¬ 
trieved.  For  more  examples,  see  [19,  25].  This  retrieval 


Figure  12:  Car  Query  for  retrieval  by  indexing 


was  performed  on  a  database  of  1600  images  taken  from 
the  Internet,  the  Library  of  Congress  and  other  sources. 
The  database  consists  of  faces,  monkeys,  apes,  cars, 
diesel  and  steam  locomotives  and  a  few  houses.  Lighting 
and  camera  parameters  are  not  known. 

5  Conclusion 

This  paper  has  described  the  multimedia  indexing  and 
retrieval  work  being  done  at  the  Center  for  Intelligent  In¬ 
formation  Retrieval.  Work  on  systems  for  finding  text 
in  images,  indexing  archival  handwritten  documents  and 
image  retrieval  by  content  has  been  described.  The  re¬ 
search  described  is  part  of  an  on-going  research  effort 
focused  on  indexing  and  retrieving  multimedia  informa¬ 
tion  in  as  many  ways  as  possible.  The  work  described 
here  has  many  applications,  principally  in  the  creation  of 
the  digital  libraries  of  the  future. 
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